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In this paper, the work dene by Rowe in [Ref. 1] and 
[Ref. 2] is combined in the implementation of a rule based 
system to produce upper and lower bounds and estimates as to 
the size of intersections and unions of simple sets. The 
system constructed for this paper uses the hierarchy of 
knowledge levels of [Ref. 2 ] as the tabulated statistics in 
the database abstract. The system is tested and analyzed to 
determine the advantages a-'d disadvantages of increasing the 
knowledge level of the database abstract used for the calcu- 
lation and of varying the number of partitions used in 
constructing the database abstracts. The paper also 

discusses the implementation details encountered during the 
construction of this system in the PROLOG programming 
language. 
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I. BACKGROOND AND THEORY 



A. HOTIVATIOH FOR DEVELOPING A NEB QUERY METHOD 

Traditional sequential methods of dealing with statis- 
tical queries on numerical databases involve the inspection 
of each data element within the database to determine if 
that element satisfies the queried condition. For a large 
database of thousands of items, such a search results in 
poor response time and tremendous use of machine resources 
regardless of the size of the queried set with respect to 
the size of the database. Even when indexing techniques are 
employed to provide more direct access to indexed data items 
within the database, query response time can be unsatisfac- 
torily slow when the database is large and is located on 
numerous pages in a virtual paging system. The primary 
reason for this slow response is that if the database is 
randomly distributed on the queried attribute the data items 
required by the query will be dispersed throughout the data- 
base and many of the pages containing the database will have 
to be transferred into primary storage to retrieve the data. 
For databases which are extremely large, such as the U. S. 
Census Bureau database, such a delay can be extremely long 
and quite unsatisfactory. One solution to this problem, 
suggested in [Ref. 1], involved the use of an additional 
file of information about the database called the database 
abstract. Indeed, Rowe also suggested that access to the 
database itself might even be unnecessary if sufficient 
information could be stored in the database abstract to 
answer the user's queries within a satisfactory tolerance. 
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B. DEFIHITION OF A DATABASE ABSTRACT 



The concept of a database abstract should not be that 
unfamiliar to the reader as it is ^uite similar to that of a 
mathematical equivalence class. Recall that in mathematics 
when an equivalence relation is defined over a domain, all 
elements which are equivalent to each other can be qrouped 
into a set called an equivalence class. If the original 
domain were a group (in the mathematical sense) closed under 
a certain operation, then the collection of equivalence 
classes defined by the given equivalence relation will also 
be a group (called the quotient group), when a new operation 
is appropriately defined in terms of the original operation. 
In the context of a database, the equivalence relation 
appears in the form of a partition of the database on one of 
its attributes. All data items which are within a partic- 
ular partition with respect to the partitioning attribute 
are considered to be equivalent and thus the database 
abstract is similar to a 'quotient database'. As such the 
database abstract contains information about the equivalence 
classes rather than the individual data items. The most 
difficult question resulting from this approach is the 
problem of how to capture as much of the information 
contained in the individual data items of the equivalence 
class as possible, in the most compact storage area 
possible, when tabulating the information to be recorded 
about the equivalence class. 

For the purposes of this paper, we will consider a 
sample database of medical information about patients in a 
particular hospital. Each data item consists of a six-tuple 
containing the patient number, sex, disease activity level, 
temperature, cholesterol level and prednisone level of a 
particular patient. Thus each data element in this database 
contains six attributes. If we define a partition of the 



database on the sex attribute, we obtain two equivalence 
classes, one containing all tuples with male for the value 
of the sex attribute and the other all tuples with female 
for the value of the sex attribute. Our new database 
abstract contains only two data items. Obviously we cannot 
record a disease activity, temperature, patient number, etc 
for the equivalence classes because they are a different 
type of entity than the tuples of which they are composed. 
We can, however, record the mean value (just as an example) 
for disease activity, temperature, etc. for all the tuples 
in each equivalence class and still retain some portion of 
the information which is contained in the individual tuples. 

This is the basic idea of the database abstract except 
that the database abstract utilizes multiple equivalence 
relations to formulate orthogonal equivalence classes over 
each of the attributes in the relation. For example, in 
addition to the partition discussed above over the sex 
attribute, the database can be partitioned over the tempera- 
ture attribute. All patients with temperatures less than 
35.5 are considered to have low temperatures, those between 
35.5 and 36.5 inclusive to have normal temperatures and 
those above 36.5 to have high temperatures. Under this 
equi alence relation, the database is partitioned into three 
parts. All tuples in each partition are equivalent with 
respect to this relation. We can record statistical infor- 
mation about all the equivalent tuples in each partition and 
attempt to capture as much information about these tuples as 
possible. In like manner the database can be partitioned 
over each of its attributes whether they be numerical or 
non-numerical and thus the final database abstract can be 
thought of as an orthogonal collection of equivalence 
classes with certain selected statistical information 
recorded about each equivalence class in an attempt to 
capture many of the character istics of the database in a 
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much smaller storage area than the actual database would 
contain. 

Even though attributes may be partitioned into eguiva- 
lence classes regardless of whether they are numerical or 
non-numerical, there is an inherent difference in the flexi- 
bility of the partitioning process between these two types 
of attributes. For instance, in our example above 
concerning the temperature attribute, we may just as easily 
have decided to divide the database into low temperatures, 
low normal, high normal and high temperatures instead of the 
three partitions defined above. This new partitioning would 
simply involve a redefinition of the boundaries of the 
partitions. In fact, the database could be divided into any 
number of partitions between two and the number of distinct 
values for temperature in the database depending upon the 
desires of the user. Now consider the non-numeric attri- 
bute, sex. It must be partitioned one and only one way 
because it has only two distinct values. Therefore during 
this partitioning process, we must handle numeric and non- 
numeric attributes in a slightly different manner. 

The above discussion has raised two important questions 
which we must consider in constructing the database 
abstract. First, for the numeric attributes, how many 
partitions should be chosen for each attribute to maximize 
the amount of information about the original database which 
is captured in the database abstract while still minimizing 
the storage space required to store the database abstract. 
This question will be analyzed in considerable depth in 
chapter three. The second question concerns what informa- 
tion should be recorded in the database abstract about each 
equivalence class in crder to again maximize the real infor- 
mation content while still keeping the size of the database 
abstract as small as possible. 
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C. THE INFORHATION CONTENT OF THE DATABASE ABSTRACT 



In [Ref. 1], that author included 16 statistics for each 
equivalence class in his database abstract in order to be 
able to service a large number of different queries on 
various statistics. In this discussion, we restrict 
ourselves to queries on the size statistic in order to more 
easily direct our attention to other questions about the 
formulation of the database abstract. Here the size 
statistic refers to the number of database tuples which fall 
into a particular set defined by the user. The barious 
methods to define these query sets will be discussed later 
in this chapter. For the information content of the equiva- 
lence classes constituting our database abstract, we have 
used the five levels of knowledge defined in [Ref. 2]. 
These levels represent increasing quantities of information 
and storage requirements and result in increasing accuracy 
of query estimation results. 

As discussed in [Ref. 2 ], the first level of knowledge 
contains only the size of each equivalence class. Therefore 
this database abstract would contain the size of each parti- 
tion of each attribute in the database. The total informa- 
tion content wou ’ dep^t.id on the number of partitions chosen 
for each of the imerical attributes in the database. The 
second level of knowledge contains in addition to the size 
information, the mode frequency and number of distinct 
values of each attribute in every equivalence class in the 
database. Therefore, if each of the five numerical attri- 
butes in our medical database were divided into four parti- 
tions and the non-numerical sex attribute into two 
partitions this database abstract would contain 2* (4*5*5 + 
2*5) or 220 additional data items. This level of knowledge 
is equivalent to that referred to as level 2a in [Ref- 2]. 
The level three database abstract for this system 



corresponds to the level 3a abstract of [Ref. 2] and 
includes the frequency of the second most frequent value and 
the frequency of the median frequency value for each attri- 
bute in each equivalence class and therefore adds an addi- 
tional 220 data items to the database abstract file. level 
four is a logical extension of levels two and three and 
contains a list of the frequency of the most frequent 
through the kth most frequent value for each attribute in 
each equivalence class in the database where k represents 
the number of distinct values of the given attribute in that 
equivalence class. Finally, the highest level of knowledge 
is the tagged frequency distribution list of level five. 
This database abstract contains a list of the values of each 
attribute in each equivalence class together with the 
frequency of occurrence of that value in the given equiva- 
lence class. This is the highest level of knowledge in the 
context of this hierarchy because it represents the size of 
all first level intersections cf all sets in the database 
and as such corresponds to level one knowledge if we had 
considered these first level intersections as the primitive 
sets in the database. 

The choice of these particular items of information is 
not a hard and fast requirement for a system of this type. 
Other items of information may be substituted for those 
included above. For example, instead of the second most 
frequent value in level three, the least frequent or anti- 
mode frequency of each attribute in each equivalence class 
could be chosen. The important requirement for the purposes 
of this system is that the levels of knowledge represent a 
sequence of increasing information content in the database 
so that we can demonstrate an increasing accuracy of query 
estimate results and weigh these benefits against a greater 
storage requirement and a decreasing speed of execution as 
higher level database abstracts and their correspondingly 
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r information 



answer a query is the c 
the given i out conditi 
contains a more abstra 
information content of 



more complex rul > are u od. In fact, 
bes as frequenc inform cion could concei. .y be used to 
capture the information content of each eg ^valence class, 
however the frequency information discussed above seems to 
fit more appropriately into a well defined hierarchy than 
any other types of information considered. 

Now that the information content of the various levels 
of knowledge has been described, an appropriate operation on 
that information must be defined in order to gain the best 
possible estimate of the size of any queried set. Note that 
the basic operation or the original database in order to 

ration of search and comparison to 
3. Because the database abstract 
and concise representation of the 
the original database, the basic 
search operation will no longer suffice. A new operation 
must be defined in terms of the base operation in a well 
defined manner to produce accurate estimates to user's 
queries in much the same manner that a new operation is 
defined on a quotient group in terms of the original opera- 
tion. The term operation is used quite loosely in this 
context be use it actually takes the form of a database of 
rules whi extract the queried information from the infor- 
mation content of the database abstract. The search and 
comparison operation is used to locate and apply the appro- 
priate rule for a given set of input conditions in the form 
of a query. The following chapter will describe this data- 
base abstract operation by describing the ru"' as used in the 
prototype system constructed to implement sue,* a query esti- 
mation system. 
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D. HYPOTHESES TO BE CONSIDERED 



The remainder of this paper will describe a system which 
calculates upper and lower bound and estimates for the size 
of a given set using only the information contained in the 
database abstract. Various characteristics of the database 
abstracts will be varied and the system will be tested to 
determine what effect these characteristics have on the 
system responses. At this point it is appropriate to 
hypothesize about the expected effects of varying these 
characteristics. 

First of all it is expected that using higher levels of 
knowledge will produce tighter upper and lower bounds but 
will also result in increased system response times and 
greater storage requirements for the database abstract. It 
is also suspected that there is one level which is the best 
to use in the sense that it causes much tighter upper and 
lower bounds than the next lower level but results in only a 
modest increase in the response time and storage required 
over the next lower level. Through our testing we will 
attempt to determine if such a 'best* level exists and if 
so, which of our levels it corresponds to. 

The second hypothesis for the system is that as the 
number of partitions of the database abstract increases, the 
system will produce tighter upper and lower bounds but will 
suffer from longer response times and larger storage 
requirements for the database abstract. It is also believed 
that there exists an optimum granularity for the databse 
abstract in the sense that increasing the number of parti- 
tions above that granularity will result in only slightly 
tighter bounds but will result in constant increases in 
system response times and storage required for the database 
abstract. Through our testing we will attempt to determine 
what this optimum granularity is for a sample database and 
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what charactac teristics of that database determined this 
optimum granularity. 

The third and final hypothesis for this system concerns 
the method that the user uses to input his query set. In 
order to make the system as general purpose as possible the 
user is allowed to enter his sets by specifying ranges of 
attributes. For example, using the sample medical database 
utilized throughout this paper, if the user wishes to know 
the size of the set of all patient with temperatures between 
36 and 37 who also have a cholesterol level between 230 and 
250 he may specify the query set: 

temp (36,37) * chol (230, 250) . 

This method has the advantage cf being very general purpose 
but suffers the penalty of less accurate bounds generated by 
the system as will be seen in chapter three. If the user 
queries the actual partition sets of the database abstract, 
the bounds produced by the system will be much tighter, 
however, the user may have no need to know the size of such 
sets. As a solution to this conflict between tighter bounds 
versus a more general query method, we submit that for many 
databases, there exist predefined ranges of attribute values 
which are queried often by the user community. If such 
ranges were forced to be actual partition sets during the 
generation of the database abstract, a significant improve- 
ment in the tightness of calculated bounds could be achieved 
when such ranges were queried by the user. Testing on such 
a specially constructed database abstract will be conducted 
to determine if the expected benefits are in fact as great 
as expected. 
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II- I HP1.EME HTAT I0M OF THE DATABLE ABSTB ACT JiOEBY ESTIMWION 

SYSTEM 



A. GEHEBAL IHFOBHATIOH 

The previous chapter introduced the operation on the 
database abstract equivalence classes as a database of rules 
which extract the queried information from the primitive 
information stored in the database abstract. This database 
of rules implies that the most natural method to implement 
such a system would be as a rule-based production system 
where the system must decide which rules apply to the given 
query and apply only those rules to produce the most accu- 
rate result. Such a prototype rule-based system was imple- 
mented as a part of this project again utilizing the medical 
database described in chapter one. This chapter will 
describe in some detail the structure and organization of 
this system (to be called DAQDES/ for Database Abstract 
QUery Estimation System for the remainder of this paper). 

The first major decision to be considered before 
commencing the construction of this system was the program- 
ming language to utilize. After considering both Lisp and 
PRDLOG, the advantage of faster speed of execution because 
of the Lisp compiler seemed tc be surpassed by the more 
natural syntax and excellent pattern matching capabilities 
of PfiCLOG. Because this system is only a prototype system, 
speed of execution was not considered to be as important as 
the ease of understanding the logic of the rules in the rule 
base provided by the PROLOG language. Additionally, this 
author also believes that the power of logic programming is 
just now being recognized and such languages may very well 
become the principal ones for artificial intelligence 
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programming applications in the very near future. It seems 
advantageous for computer scientir 3 outside of Japan to 
begin to learn the strengths and weaknesses of this 
language. 

The DAQUES system consists of two major parts, the first 
being the database abstract generation portion and the 
second the actual query system. These subsystems are almost 
completely independent of each ether, their only link being 
that the output of the generation subsystem is the input to 
the query subsystem and the file 'attrlist* which is 
consulted by both subsystems. The subsystems will be 
described in the following two sections. 

B. THE DATABASE ABSTBACT GENERATION SUBSYSTEH 

1 • General Ch ar a cteristic s of the Sub syste m 

The database abstract generation subsystem is not 
intended to be used by the average database user with only 
query privileges. Use of this system will require a rela- 
tively sophisticated knowledge of the meaning of the term, 
database abstract, and of the the merits of the various 
tradeoffs which must be made when constructing the database 
abstract. For this reason, it is recommended that the use 
of the database abstract generation subsystem of DAQUES 
require the user to hold the database administrator privi- 
lege or its equivalent. 

The database abstract generation subsystem is 
constructed to be as general as possible so that it can be 
applied to an arbitrary database. However, there are a 
minimal number of facts in the system which are specific to 
the example medical database. These facts are located at 
the end of the ’attrlist' file and provide a list of the 
attibute names for the database, a designation of the attri- 
butes which are non-numeric and their values in the 
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database. These data items would be the only portion of the 
entire system which would have to be altered to port the 
system to another database. 

2* Th e User I nterface 

The most visible portion of the database abstract 
generation subsystem is the user interface section which 
prompts the database administrator for all required input 
information needed to construct the desired database 
abstract and converts that information into a form which the 
rest of the program can understand. The information needed 
is the name of the database to be processed, the level of 
knowledge for the particular database abstract to be 
constructed (could be 'all' if the administrator desires to 
construct four levels at once) , the number of partitions 
that each attribute within the database is to be divided 
into and the name of the output file or files in which to 
store the database abstract. The name of the database file 
may seem to be unnecessary information, after considering 
that several rules in the system are specific to a partic- 
ular database. With the current system it is true that the 
database name is unnecessary. However, it is possible to 
process one or more databases with no attribute names in 
common and therefore include all necessary rules for all of 
the databases. The rules would not interfere with each 
other because the clause heads of the rules for the inappli- 
cable database would not match, given the distinct attribute 
names. Therefore, to provide added flexibility, this infor- 
mation is included as part of the input from the database 
administrator. 

The level of knowledge to be input must be one of 
the five discussed in chapter cne or the user may specify 
'all' levels. The 'all* option is a bit of a misnomer 
because it calculates only levels two through five. This 
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distinction is because of the fundamental difference between 
the calculations for level one and those for levels two 
through five to be discussed later in this chapter. The 
advantage to choosing the ’all' option is that the database 
need only be processed once to produce four different levels 
of knowledge whereas, if a particular level is specified, 
the database is processed once for each level. Therefore, 
the ’all’ option can provide a significant savings in time 
and machine resources especially for a large database. If 
the user specifies the ’all* option, he must also specify a 
list of output files in which tc store the database abstract 
files for levels two through five instead of a single file 
name. 

3. Th e Partitioning Proce ss 

The most difficult and important decision which the 
database administrator must make concerns the granularity of 
the partitions for each attribute. If the attribute is 
non_numeric, there is only one acceptable number of parti- 
tions and it is input from the database specific facts in 
the ’attrlist’ file. If the attribute is numeric, the user 
must input the desired number of partitions for that attri- 
bute. The attribute list file is then checked to see if the 
partition boundaries have previously been calculated for 
this granularity. If so, the tabulated boundaries are used. 
If not, the user is asked to choose between automatically 
generated partition boundaries or specific boundaries which 
he may input himself. 

The former method for determining the partition 
boundaries uses an algorithm tc attempt to place an equal 
number of values in each partition. The input to this rule 
is the number of partitions, and a list of the values of 
that attribute together with their frequency of occurrence 
in the entire data ase. This list is equivalent to level 
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five knowledge about the entire database and must be 
produced by the database administrator prior to utilizing 
the system. This information is easily generated by running 
this same system and specifying 'one* as the number of 
partitions. The algorithm divides the number of values in 
the database by the granularity and then passes through the 
frequency list including each value until its inclusion will 
cause the number of values in the partition to exceed the 
calculated size of the partition. The actual size of the 
first partition is then subtracted from the size of the 
database. That number is divided by the number of parti- 
tions minus one and the algorithm is called recursively. 
When the database contains a large number of distinct items, 
none of which occur very frequently, (i.e. even distribu- 
tion) , this algorithm is quite good for producing partition 
boundaries resulting in partitions of relatively equal size. 
However, when an attribute has one or more values which 
occur many more times than the other values in the database, 
(i.e. the distribution is highly uneven), this algorithm 
results in partitions of very different sizes and conseq- 
uently, as will be evident in our discussion of the query 
subsystem, reduced accuracy of query results. 

In the case of the highly uneven distribution, or to 
create partition sets which are meaningful in a particular 
application field, the administrator may explicitly specify 
the boundaries of the partitions. This option is a highly 
desirable one especially in the latter circumstance. Many 
attributes fall naturally into partitions because of conven- 
tions within the application area or merely for convenience. 
As mentioned in chapter one during the introduction to the 
meaning of the database abstract partitions, the temperature 
attribute may be divided into low temperatures, normal 
temperatures and high temperatures by specifying the numeric 
convention for defining these terms. In the context of our 
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program, such user specified partition boundaries may not 
result ^ . partitions of r itively equal size, however the 
partitions themselves wil_ oe sets which are much more mean- 
ingful to the expected user of the system than the algoritm- 
ically produced partitions. For reasons that will be 
discussed in depth later in this chapter, there are also 
tremendous advantages in accuracy of the bounds calculations 
for the system user when his query set corresponds exactly 
to a union or intersection of the exact partition sets 
rather than numerically specified boundaries. 

For example, if the user is interested in the set of 
male patients with temperatures bet een 35.5 and 36.3, he 
may explicitly query this set with no knowledge of the 
partitions contained in the database abstract. If this set 
does not correspond exactly to the boundaries generated by 
the partitioning algorithm, the query preprocessor must 
determine the minimal union of actual partitions which 
covers this desired set and calculate the upper bound for 
the query based on this set. In a similar manner, to calcu- 
late a lower bound on the query, the query preprocessor 
determines the maximal union of actual partition sets which 
is completely included in this desired set and calculates 
the lower bound of the query based on this set. If, on the 
other hand, the database administrator is aware of commonly 
queried sets in the application area such as the normal 
temperature range, he may explicitly specify that the parti- 
tion boundaries correspond to these ranges, providing much 
more accurate query results for such sets. 

To extend this concept cne step further, the data- 
base administrator may specify an alias for that partition 
name such as ’normtemp’. Using this facility allows the 
user to express his query in terms of the 'normtemp’ set 
instead of the conventional name for the second partition 
set for the temperature attribute, i.e. temp (2). The 



24 



advantages of explicitly specifying partition boundaries for 
particular attributes should become more apparent after 
reading the discussion of the query subsystem of DAQUES. 

Whichever method the database administrator chooses 
to arrive at the boundaries of the partitions, these bound- 
aries are inserted into an attribute list and added to the 
'attrlist' file for use in later database abstract genera- 
tion executions and for use by the preprocessor of the query 
subsystem. Additionally a list of the attribute names and 
the number of partitions is written into the level one data- 
base abstract file and is passed on to the ’frequencies' 
file which performs the actual database abstract generation 
function. 

4 . Gener ati on of the Aerial Database Ab stra cts 

First we will discuss the level one calculation. As 
mentioned previously, there is an inherent difference 
between the method of calculaticn for the level one database 
abstract file and the remaining levels. Recall that level 
one information contains the size of each partition with 
respect to each attribute in the database. When the user 
interface rules determine that level one knowledge is 
desired by the database administrator the 'level 1’ file is 
consulted. Consequently, when the 'ex' function is called 
by the user interface, it matches the level one 'ex' rule 
and begins the calculation cf the level one database 
abstract. The first task for this calculation is the 
initialization of the totals for the size of each partition 
located in the PROLOG internal database to zero. Then each 
tuple of the database is read in from the database file. 
The 'check' function is called for each of the attributes in 
the list passed to this file by the user interface. The 
'check' rule determines the value of the given attribute in 
the current tuple. It determines the partition set into 
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which this value falls and increments the total for that 
partition set in the inte; nal database by one. After the 
entire database has been processed this manner, level one 
information is contained in the PROLOG internal database. 
The ’ continueall • clause accesses all of this information 
and writes it to the external file specified for the level 
one database abstract by the database administrator. 

When the database administrator specifies any level 
other than level one, the user interface rules consult the 
* freguencies’ file as well as the corresponding ’level’ 
file. The ’frequencies’ file produces the same information 
for levels two through five and asserts that information 
into the PEuLOG internal database for use by the appropriate 
’level’ file. The information produced by the ’frequencies’ 
file is exactly the level five database abstract information 
and from this information, levels two through four informa- 
tion can be derived. Recall that level five information is 
a list of the values that occur for each attribute in each 
of the partition sets together with the frequency with which 
those values occurred in that set. Therefore the first task 
accomplished by the ’ f reguencies’ file is to assert an empty 
list as the initial tagged frequency distribution list for 
each attribute and each partition set in the database 
abstract. Then, each tuple is read from the database file. 
The ’checkalla ttrs* rule is invoked for each attribute. 
This rule finds the value of the ’partitioning attribute’ 
from the current tuple and determines which partition set 
that tuple belongs to with respect to the ’partitioning 
attribute’. Then the ’check’ rule is invoked for each 
numeric attribute. This rule finds the value of the ’check 
attribute’ and accesses the frequency list for the partition 
set of the ’partitioning attribute’ with respect to the 
’check attribute’. The ’intolist’ rule searches the list 
for the value of the ’check attribute’ and increments the 
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frequency corresponding to that value by one if the value is 
found. If the value is not found in the list, the value is 
inserted into the list with a frequency of one, maintaining 
the ascending order of values (not frequencies) within the 
list. This process is repeated with each numeric attribute 
substituted as the ’check attribute’. Then the ’checkal- 
lattrs’ rule is called with the next attribute substituted 
as the ’partitioning attribute’. 

5. Proces s ing an Exam ple Tuple 

As an example of this process, recall the medical 
database with its six attributes; patient number (patno) , 
sex, disease activity (da), temperature (temp), cholesterol 
level (chol) , and prednisone level (pred) . The values of 
these tuples are arranged in that order within the input 
tuples so that an input tuple would appear as follows: 

(10, female, 12,36.2,230, 55) 

The ’ checkallattrs’ rule would first be invoked with the 
patno attribute as the ’partitioning attribute’. The value 
of 10 for patno would place this tuple in the patno (1) 
partition, for example. Then the check rule would be 
invoked with the patno attribute as the ’check attribute’ 
and the following list would be searched for the value 10: 

data (fregdist , patno (1) , patno, [[4,1 ],[7,2],[ 10, 3], [ 15,6]]) 

In this case, 10 would be found and its frequency, 3, would 
be changed to 4. Then the check rule would be invoked with 
da (the next numeric attribute) as the ’check attribute' and 
the following list would be searched for the disease 
activity value of 12: 

data (fregdist, patno (1) ,da,[[5,3],[ 14, 4], [24, 5]]) 
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In this case the value of 12 wculd not be found so the two 
element list [12,1] would be inserted into the list to 
produce the following new list: 

data (f regdis t, pat no (1) ,da,[[ 5,3 ],[ 12, 1 ],[ 14,4 ]#[24,5]j) 

In a similar manner the ’check* rule would be invoked with 
the attributes temp, chol and pred as the ’check attri- 
butes’. Then with the patno attribute execution of the 
’ checkallattrs’ rule completed, this rule is invoked with 
the sex attribute as the ’partitioning attribute’. The 
partition set ’female’ is determined to be the one to which 
this tuple belongs with respect to the sex attribute and the 
’check’ rule is invoked with each niaeric attribute as the 
’check attribute’. In like manner, the ’checkallattrs’ rule 
is invoked with the da, temp, chol and pred attributes as 
the ’partitioning attribute’ tc complete the processing of 
this one tuple. Then the next tuple is read in and the 
process repeated. 

When the entire database has been read and 
processed, the internal PROLOG database contains tagged 
frequency distribution lists for every partition and each 
attribute. k good way to visualize this info ition is to 
recall that the partitions or equivalence cl ;es of our 
database abstract are the primitive data items jf the ’quo- 
tient database’. We are attempting to capture as much of 
the information content of the tuples of the original data- 
base as possible in our database abstract. We cannot tabu- 
late a value for each of the attributes for the partitions 
because the partitions are made of a collection of tuples of 
the original database. Instead we chose in level five to 
substitute this tagged frequency list for each attribute as 
the ’abstract value’ of that attribute. In terms of our 
example, instead of tabulating a da value for the patno (1) 
partition we tabulate the list cf the da values from all the 
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tuples which fall into the patno(l) partition and the 
frequency of occurrence of each value in that partition. 

Now that all this information has been asserted into 
the PROLOG internal database, the 'frequency* file invokes 
the 'continue' rule. The content of this rule will depend 
on which 'level' file was consulted by the user interface 
after inputting the desired level of knowledge from the 
database administrator. If the desired level is five then 
the 'continue' rule merely writes all of the tagged 
frequency distribution lists to the external file designated 
by the database administrator to receive the level five 
database abstract. If level four is chosen, the 'continue* 
rule compacts the tagged frequency lists removing the values 
of the attributes and leaving only the frequencies. These 
frequencies are then sorted in descending order and written 
to the file specified for the level four database abstract. 
If level three is designated, the tagged frequency list is 
sorted in descending order of the frequencies and the second 
and middle elements of the list are extracted, corresponding 
to the frequencies of the second most frequent and the 
median frequency value of the check attribute on that parti- 
tion. These values are written in an appropriate format to 
the file designated for the level three database abstract. 
Finally, if level two is specified, the tagged frequency 
lists are again sorted in descending order of frequencies 
but the first frequency in each list and the length of the 
list is written to the appropriate file, representing the 
frequency of the most frequent value and the number of 
distinct items in the partition. 

It should now be obvious to the reader that the 
calculation of the database abstracts for levels two through 
five involves a common initial calculation. When the data- 
base administrator specifies 'all* as the level of knowledge 
for the database abstract calculation, the 'alllevels' file 
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is consulted in place of a specific ’level' file. The ’con- 
tinue' rule in this file invokes four individual rules 
called ’continue2’, ’continue!', 'continueU', and 'con- 
tinues', all identical to the 'continue' rules contained in 
their respective 'level' files. In this way the 'continue' 
rule for the 'alllevels' file processes the tagged frequency 
distribution lists located in the PROLOG internal database 
producing all four desired database abstract files. 

Now that the database abstract generation subsystem 
of the DAQUES system has been described in detail, we will 
continue our discussion in the following section with a 
description of how these database abstract files are used by 
the query subsystem to produce bounds and estimates for the 
user's queries. 

C. THE QUERY SUBSYSTEM. 

1 . Structure of the Subsy s tem 

After the database administrator has completed the 
construction of the database abstract files, the second 
subsystem of the DAQUES system should be ready to be used 
with no modifications. This subsystem was designed to input 
all database specific information, such as rules concerning 
attribute names, frcm the database abstract files and the 
'attrlist' file. Therefore, the query subsystem should be 
able to query multiple database abstracts independent of 
their particular partitions or even the database from which 
they were constructed, provided that the database abstract 
was built entirely with the generation subsystem of the 
DAQUES system. Of course, this system is designed for use 
by the standard user with only query privileges. 

The query subsystem has three major program parts, 
the user interface and help feature, the query preprocessor 
and the actual calculation portion of the program. With 
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regard to file organization, the rules are divided into five 
files. The ’master’ file contains the user interface and 
all preprocessing functions as well as many rules too 
general to be included in one of the specific files (e.g. 
the ’member’ rule). The ’info’ file contains the rules for 
the help feature of the user interface. It is currently a 
rather limited feature but was written with future expansion 
in mind. The ’sups’ file contains all the rules for calcu- 
lating the upper bound on the size of a query. It has its 
rules divided into those for unions and those for intersec- 
tions and then subdivided into rules for levels one through 
five within the union and intersection divisions. The 
’infs’ file contains the rules for calculating the lower 
bound on the size of the queried set. Its organization is 
identical to the ’sups’ file. Finally, the ’ests’ file has 
all the rules for calculating an estimate for the size of 
the input set. It has only distinct sections for union and 
intersection rules. 

Within the ’sups’, ’infs’, and ’ests’ files the 
basic program structure is quite similar. The ’dosup’, 
’doinf’, and ’doest’ rules each input a statistic, a set, 
and an attribute. The statistic will always be ’size’ in 
the current system implementation, however, the system is 
designed so that additional rules might be added at a later 
time to extend the capabilities of the basic system. Also 
for this reason the attribute is input to these rules to 
allow future queries such as the mean value of temperature 
for the input intersection or union set. With this exten- 
sible format, a general purpose user interface and prepro- 
cessor could invoke the appropriate ’do’ rule merely by 
passing the proper arguments to the rule. 

When the query sets are input to the ’do’ rules, 
they are in prefix form or in PROLOG terminology, they 
consist of a single argument functor, where the functor name 
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must te either ’and' cr ’or’. The single argument to these 
functors contains a list of the sets in the intersection or 
union and as such can be of arbitrary length. Additionally 
any individual element of the list can be one of these 
single argument functors indicating a lower level union or 
intersection. Only time constraints limit the size of the 
intersection or union list or the nesting level of unions 
and intersections within unions or intersections. Because 
of this PROLOG list structure embedded within a functor, the 
system becomes very general purpose and handles a very wide 
assortment of input guery sets. Also such questions as the 
optimum normal form of an input query become much less 
important than they are in systems with a much more struc- 
tured input set. In £Eef. 1], for example, the guery system 
inputs sets in prefix form with two arguments only. 
Therefore, in order to guery a set such as 
and (lowtemp,lowchol, male) , that system had to decide which 
of the three possible formats, and (and (lowtemp, lowchol) , 
male), and (lowtemp, and (lowchol, male)) or and (and (low temp, 
male) , lowchol) , produced the best guery results. (In his 
system, Rowe decided to process all three guery forms and 
use the most accurate results.) Such questions do not apply 
in a list oriented input format system. 

The list format for the input sets does require a 
very different processing strategy, especially when the next 
set in the input list might very well be a very large union 
or intersection of other sets. Tests such as disjoint and 
subset rules become much more complicated because they must 
be more general purpose than systems with a fixed number of 
arguments. The general processing strategy for this format 
involves examining each set in the argument list. The set 
is checked to see if it is a ’simple* set in the sense that 
it has size information stored about it in the PROLOG 
internal database or in the system cache. If it does, the 
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program retrieves this information and moves on to the next 
set. If the current set has nc data either in the internal 
database or in the system cache, then the set is not 'sim- 
ple' and the appropriate 'do' rule is invoked for this set. 
When this 'do' rule returns from execution, the set will 
have been converted to a 'simple' set in that it will have 
all the same information stored about it in the system cache 
that an actual partition set has stored in the internal 
database at the time program execution started. Another 
important point is that if this same union or intersection 
occurs at another point in the same query or in a subsequent 
query within the same session, the union or intersection set 
still appears as a 'simple' set to the system. A more thor- 
ough discussion of the details of this concept will be 
presented following the presentation of the user interface 
and preprocessing sections of the query subsystem. 

2* The Oser Inte rface and Help Fe at ures 

Because this system is intended to be used by the 
novice user, it includes a very friendly user interface and 
an expandable help facility. The user interface must input 
several items of information from the user and using this 
information, must control the consulting of the proper files 
to ensure the results desired by the user. The user inter- 
face first asks the user what level of knowledge he wishes 
to use for his query and gives the user the choice of a 
normal, alternate, or special database abstract file. (It 
is intended that database abstracts with several different 
granularities of partitions or aliases will be available to 
the user depending on his accxiracy needs or the partition 
conventions which he commonly uses.) If the user does not 
understand the different levels of knowledge or does not 
remember which database abstract files are currently associ- 
ated with the terms normal, alternate and special, he may 
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simply respond by typing 'help'. This will produce an 
explanation of the current database abstracts available as 
well as a discussion of the differences in levels one though 
five knowledge. After the user has entered this informa- 
tion, the user interface must consult the appropriate data- 
base abstract files as well as the alias file and attribute 
list files discussed in the section on the database abstract 
generation system. The user interface then stores knowledge 
of the current level that the user is utilizing so that 
during the next query, no additional files need be consulted 
unless the user wants to use a new level of knowledge or a 
new database abstract file. 

The user interface next inputs the statistic to be 
queried which currently must be the 'size' statistic. If 
the user inputs a different statistic, the help feature will 
provide an explanation that only the 'size' statistic is 
currently operational on this system. It is anticipated 
that when other statistics such as mean are provided in the 
system, the user interface will next ask the attribute that 
the statistic is to be queried over. Because the size of a 
set is not associated with any particular attribute, the 
system currently omits this question. 

The user interface next inputs the set to be queried 
in infix form. Again failure to input a proper set will 
result in a call to the help facility and an explanation of 
the proper format for inputting sets. It was decided to 
input the query set in infix form after several weeks of 
testing the program using the prefix form as input for the 
sets. The prefix, list oriented set notation is quite 
useful for providing generality to the program, however, 
requiring the user to happen upon the proper combination of 
parentheses and square brackets needed to input the desired 
set completely counteracted all other attempts made to make 
this interface user friendly. As an example, consider the 
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query wishing to know the size of the set of all males in 
either the first or second cholesterol partitions. The 
prefix form of this sets is the following: 

and ([ male, or ([ chol (1) , chol (2) ]) ]) 

Compare the above notation with the infix notation shown 
below ; 



male* (chol(1) + chol(2)) 

It should be obvious how this advantage in simplicity 
becomes increasingly more important as query sets become 
more complicated. Therefore it was decided that the small 
degradation in system performance needed to implement an 
infix to prefix preprocessor was justified. This prepro- 
cessing feature is described in the following subsection. 

3 . The Pre proc e sso r. 

One need for a set format preprocessor was intro- 
duced in the previous section. This subsection will demon- 
strate two additional needs for preprocessing and describe 
the specific performance of the rules to implement these 
func tions. 

During the first subsection above the general nature 
of the input set was discussed in some detail. One of the 
disadvantages of this format is the capability it provides 
to the user to input sets which execute much less effi- 
ciently than a different expression of the same set would 
execute. For example, the set: 

chol{1) ♦ chol (2) *■ chol (3) 
translates into the prefix form: 

or ([chol (1) , chol {2) , chol (3) ]) 
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which executes much more efficiently and accurately than the 
query; 

{chol (1) + chol(2)) + chol(3) 

which translates into prefix as follows: 

or ([or (£ chol (1) , chol (2) ]) ,chol (3) ]) 

The two sets are logically identical and should result in 
the same upper and lower bounds and estimates from the 
system. However, they would not without the second function 
of the preprocessor. This function is descriptively called 
the 'squash' function because of its capability to transform 
a multilevel set into a logically equivalent single level 
set which will produce quicker and more accurate results. 
(For a thorough discussion of the 'optimal form' of a query, 
refer to [Eef. 2] ). This function does not just provide 

protection against the novice user who inserts extra paren- 
theses where they are not needed. The output of the third 
feature of the preprocessor, described below, is often in a 
form which needs to be 'squashed' to execute efficiently and 
accurately. 

During the discussion of the alias i ire of the 
generation subsystem, it was mentioned that, in aneral, the 
input sets to the system would be in the form of unions or 
intersections of sets defined by a range of values of a 
particular attribute. So far we have only discussed sets 
which are exactly one partition set of a particular attri- 
bute. It was also mentioned that such exact partition sets, 
especially when they are aliased to a descriptive name, are 
quite useful when a particular attribute was partitioned 
during the database abstract generation, to correspond 
ex :tly to a set which is of some importance in the applica- 
tic field and therefore -ould commonly be a part o various 
queries. However, when the atove conditions do not apply. 
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it is appropriate to allow the user to express the guery set 
in terms of a range of values ever a particular attribute. 
The guery subsystem allows the user to express his guery set 
in either of the formats as well as in terms of some alias 
name for a particular partition set. The third feature of 
the preprocessor provides the conversion of these forms to a 
common form to pass on to the calculation section of this 
subsystem. 

To provide a better understanding of the three 
stages of the preprocessor, we will follow the execution of 
a particular guery set from its entry into the system until 
it is passed to the calculation system for actual 
processing. Consider the following guery: 

male * chol (1 15 , 280) 

which means that the user wishes to know the size of the set 
of all male patients with cholesterol levels from 115 to 280 
inclusive. The user interface passes this infix form to the 
’convert’ rule which matches the ’*’ symbol producing an 
’and’ functor with arguments of whatever is returned from 
the recursive calls to ’convert’ made for the left and right 
sides of the symbol. Had either of these sets been a 
lower level (parenthesized) union or intersection, then one 
of the elements of the functor’s argument list would have 
been another functor. In fact, for infix expression without 
parentheses that are longer than two elements, the ’convert’ 
rule uses left precedence rules and operates as if there 
were parentheses around the righthand side of the expression 
producing an output prefix expression of depth one less than 
number of terms in the the input expression. This limita- 
tion is solved by the ’sguash’ function. 

When the ’convert’ rule is invoked for a simple set 
like ’male’, it recognizes this set as a simple set and 
merely returns the set name. When it is invoked with an 
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alias name, it checks the alias list and returns the name of 
the partition set to which the alias refers. If the rule is 
invoked with a range expression such as chol ( 1 15, 280) , it 
recognizes the two arguments to chol as lower and upper 
limits to a range and calls the ’cover’ and ’iscovered’ 
rules. The ’cover’ rule uses the upper and lower limits and 
the partition boundaries stored in the ’attributelist’ facts 
to find the minimal union of actual partition sets for the 
given attribute which covers the desired range. It returns 
this union as the set to be used by the calculation system 
when the upper bound or sup of the statistic is being deter- 
mined. Then the ’iscovered’ rule uses the same information 
to find the maximal union of actual partition sets for that 
attribute which is covered by the given range. This union 
is used by the ’infs’ file when calculating the lower bound 
on the size of the set. Suppose the ’cover’ and ’iscovered’ 
rules return or([chol(l), chol (2) ]) and or (£chol (1) ]) as the 
two unions discussed above. Then the two sets to be passed 
on to the ’squash’ function for the next stage of prepro- 
cessing would be and ([male, or([chol(1), chol (2) ]) ]) and 
and ([male, or ([ chol (1) ]) ]) . When ’squash’ is called with 
the first set description it returns the same form because 
there is no ’squashing’ possible on this set. Consider, 
how /er, if the input set had been a union instead of an 
intersection. Then the input cf the squash function would 
be or ([male, or([chol(1), chol (2) ]) ]) which would be 
’squashed’ into the more accurate and efficient form 
or ([ male, chol(1), chol (2) ]) . For the second set the 
’squash’ function recognizes the single element union which 
results from the generality of the ’iscovered’ rule. This 
set will therefore be ’squashed’ into and ([male, chol(1) ]) , 
a form which will provide much more accurate and timely 
results. 
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Now that we have described how the user's guery is 
preprocessed from its various possible input formats into a 
common form, in the following subsections we will describe 
how the actual calculation rules are implemented to produce 
the guery results. 

h. Testing fo r Pis join t Sets aiid Su bsets 

The detection of sets in the input union or inter- 
section which are disjoint or which are subsets of each 
other can greatly increase the accuracy of the guery 
results. For this reason, the calculation rules contain 
complex tests for these conditions. The disjoint test 
primarily involves a search for sets which represent two 
different partitions of the same attribute which, by defini- 
tion, are disjoint. In addition, the disjoint test involves 
a complex series of tests to determine whether a union or 
intersection of sets is disjoint from another union, inter- 
section or simple set. The rules which are implemented in 
this section are discussed in detail in [Ref. 1 ]. 

The advantage of determining that an intersection 
contains two or more sets which are disjoint is that all 
further calculations may be skipped because the size of that 
intersection is most certainly zero. However, the advantage 
of determining that two or more sets in a union are disjoint 
is not so straightforward. If all the sets in the union 
were found to be disjoint from all others in the union, then 
we could arrive at the size of the set by merely adding all 
the sizes of the component sets. However if only two sets 
in a very large union list are determined to be disjoint, 
what advantage have we gained? We could consider this 
disjoint union as a simple set with a definite size egual to 
the sum of the sizes of its two components. There are two 
problems with this approach. The first is the question of 
what to do about the non transitivity of dis jointness. For 
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example, if set A is disjoint from set B and set B is 
disjoint from set C, then we cannot conclude that set A is 
disjoint from set C. Therefore how do we decide which of 
the two pairs of disjoint sets to combine into a single set? 
The second problem relates to how to convert this disjoint 
union into a true simple set in the sense that it has all 
the information tabulated about it in the cache that an 
actual partition set would have tabulated about it in the 
PROLOG internal database when program execution started. If 
the desired level of knowledge is one, there is no problem 
beca _se the size of the set is level one knowledge. 
However, if the desired level is five, the leve' 'ive infor- 
mation about this disjoint union would have to calculated 
and added to the cache. Because of these difrxcult ques- 
tions, it was decided not to test for disjointness during 
the union calculations. The level one upper bound on the 
size of a union assumes the worst case, that the component 
sets are disjoint, in arriving at the upper bound. So even 
foe this single level where the majority of the problems 
discussed above do not apply, the disjoint test would not 
demonstrate any possible advantage. 

It would be an interesting extension to this s stem 
to include a disjointness test in the union calculation, to 
develop rules to choose between co.cflicting pairs of 
disjoint unions and to write special rules for calculating 
the level two through five information about the disjoint 
union given only the level two through five information 
about the component sets. This extension to the basic 
system has great potential for improving system accuracy 
because of the prevalence of disjoint unions as input sets 
resulting from the preprocessor translation from range 
values into unions of partition sets as discussed in the 
previous se' ion. 
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The subset test cannct generally arrive at a 

specific answer for the size of an input set as the disjoint 
test can. It can, however, shorten the union or intersec- 
tion list for which the bounds and estimate calculations 
must be made. The 'testallu' and 'testalli' rules perform 
this subset test for the union and intersection rules 
respectively. The 'testallu' rule inputs the union list and 
determines whether any set in the list is a subset of any 
other. If so, the subset is removed from the output list 

because such a subset could not possibly contribute any 
tuples to the union which have not already been contributed 
by the set which contains it. In a similar manner for 
intersections , the 'testalli' rule discards the supersets 
because they cannot possibly contribute any tuples other 
than those contributed by the subset. The specific subset 
rules implemented in this system are discussed in some 
detail in [Ref- 1]. 

The disjoint and subset rules might be more appro- 
priately included in the preprocessor section of the query 
subsystem because they do actually alter the input to the 
actual calculation portion of the system in the case of the 
subset rules or abort the calculation in the case of the 
disjoint rules. It was decided to have these rules as part 
of the calculation section instead because they deal with 
the content of the sets in the input unions and intersec- 
tions. The preprocessor is mostly concerned with arriving 
at an optimum format with which to enter the input set and 
not so much with the meaning of the sets involved. 

D. IHPLEHENTATION OF ROLES FOE THE VARIODS LEVELS 

The actual calculation portion of the query subsystem 
receives three pieces of information from the preprocessor. 
These are the statistic being queried, the set being queried 
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in prefix notation, and the level of knowledge desired by 
the user. This information is contained in the three of the 
airguments of the *dosup', *doinf’, or ’doest* rules called 
by the preprocessor. Depending on the value of these argu- 
ments any of several different rules could be invoked to 
calculate the appropriate bounds and estimates. The current 
system contains rules for answering queries on the size 
statistic only. (Chapter IV contains a discussion on how to 
extend the system to ether statistics.) The prefix form of 
the set will have either an ’and* functor or an ’or’ functor 
and will match a different set of rules in the ’sups’, 
’infs’ and ’ests’ files depending on the value of this 
functor in the set notation. Finally, the level of knowl- 
edge specified by the user will cause only the appropriate 
calculations for the desired level to be executed. 

As an example of this rule matching process, we will 
continue the example that was started in the previous 
section. This section concluded that the preprocessor had 
converted the initial input set into 

and {[ male, or ([ chol ( 1 ) ,chol (2) ]) ') for the upper bound calcu- 
lation and and ([ male ,chol (1) ]) for the lower bound calcula- 
tion. Suppose also that the user had desired that level 
five be used to calculate the size of this set. In this 
case, the preprocessor would Ccill the ’sup’ portion of the 
query subsystem with the goal: 

do sup (size, and ([ male, or ([ chol (1) , chol (2) ]} ]),_,Sup,5). 

and the *inf’ portion of the subsystem with the goal: 

doinf (size , and ([ male, chol ( 1) ]) ,_,Inf,5) . 

The PEOLOG system attempts to match these goals with the 
clause heads of all the rules in the ’sups’ and ’infs’ files 
respectively until a rule with an appropriate clause head is 
found. The only rule in each of these files which will 
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match will be the appropriate general upper and lower bound 
rules for calculating bounds on intersection sets thus 
further restricting the possible rules which could be 
applied. Finally, the general 'dosup* and *doinf* rules 
call each of the 'dosupl' through 'dosupS' rules and 
’doinfl* through 'doinfS* rules passing to these rules the 
set to be queried and the level of knowledge desired by the 
user . 

The level one calculations are unique from the remaining 
levels in that they are always executed regardless of the 
specified level. Level one calculations are always executed 
because they are a part of the fundamental structure of the 
query subsystem. The level one 'dosup' rule calls the 
'supall' clause passing to it the list of the component sets 
in the union or intersection. The 'supall' clause tests 
each component set to determine if that set has data stored 
about its size in the internal PROLOG database or in the 
system cache. If the set happens to be an actual partition 
set of the database abstract as is the case with the set 
'male' above, the level one database abstract will have the 
actual size of the set recorded in the internal database. 
This size will merely be returned as an upper bound on the 
size of this component set. If, however, the set has no 
information stored about it in the internal database, as in 
the case of or ([ chol (1) ,choi (2) ]) in the example above, the 
'supall' rule recursively calls the 'dosup' rule with this 
component set as the argument. 

The purpose of this call to the 'dosup' rule is not only 
to return an upper bound on the size of the component set, 
but to convert that set into a true 'simple set' with 
respect to level five information. The term 'simple set' 
means that the set should have all information stored about 
it in the system cache that an actual partition set would 
have stored about it when system execution begins. In the 
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case of our example, the set or ([ chol (1 ) ,chol (2) ]) would 
have to have an upper bound on its size and an upper bound 
on its level five tagged frequency distribution list 
asserted into the system cache to fulfill these require- 
ments. The lower level call to the *dosup' rule executes 
exactly the same as the upper level call to that rule except 
that when the supall clause tests each of the component sets 
of this union, they are already simple sets. Therefore, the 
list of the tabulated sizes is returned and the level one 
execution continues. 

For each of the remaining level , a rule exists which 
says, for example, that the corresponding dosupx goal 
succeeds if the des. ?d level is not equal to x. In this 
case only the simple test for level five fails and back- 
tracking must occur to satisfy the 'dosupS' goal. The 
second ’dosupS’ rule contains the calculation needed to 
obtain the appropriate upper bound. The two 'dosup5' rules 
are shown in figure 2.1. In these rules L is the intersec- 
tion set list and I is the level of knowledge requested by 
the user. The goal established by the 'dosup* rule is the 
following: 

do sup (size,and([ male, or ([ chol(1) ,chol (2) ] ]) , _, Val, 5) . 

The PROLOG system will me 'h this goal to the first rule in 
figure 2.1 and instantiate the variable L to the set list 
and the variable I to 5. The first subgoal which this rule 
will then attempt to prove will be not (5=5) which of course 
will fail, causing the system to backtrack to the next 
•dosup5’ rule. This rule instantiates the same two vari- 
ables but the first subgoal, 5=5, succeeds this time and the 
second subgoal is then established. The 

'data (attributes, Alist) ' subgoal retrieves a list of the 
attribute names for the given database instantiated to the 
variable Alist. This list is provided in the 'attrlist' 
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file which contains all the database specific information 
needed by the system and which was consulted at the begin- 
ning of execution. The next subgoal set up by the system is 
the ’map’ subgoal which causes the 'sumfregs' rule to be 
executed with each element of the Alist as its first argu- 
ment. The rules for the 'map' functional and for the 'sum- 
fregs* clause and its subordinate clauses are shown in 
figure 2.1. 

The 'sumfreg' goal is first set up with patno as the 
value of the Attr variable and [ male, or ([ chol ( 1) , chol (2) ]) ] 
as the value of the variable L. The 'sumfregs' goal first 
establishes the subgoal 'getfregsS' which returns a list of 
the two level five freguency distribution lists for the 
patno attribute for the two sets in the list. Note that the 
success of this clause is only possible because the lower 
level call to the 'dosup' rule made in the level one calcu- 
lation with or ([ chcl (1 ) ,chol ( 2) ]) as the argument set 
converted this set to a simple set with respect to level 
five. In other words this lower level call asserted an 
upper bound on a level five tagged freguency distribution 
list for this union for all the database attributes. 

Next, the system passes this list of freguency distribu- 
tions cn to the 'findsumS' rule. This rule is the heart of 
the level five calculation. It is a recursive rule whose 
basis is the condition that the first freguency distribution 
in the list of freguency distributions has been reduced to 
the empty list. On each iteration, the 'findsumS' rule 
considers the first item in the first freguency distribution 
list and calls the ' minf regofitem ' rule to return the 
minimum freguency with which this item occurs in any of the 
freguency distributions of the list. This minimum freguency 
is then added onto a running sum and it is also inserted 
into a new list which will eventually represent the upper 
bound on the tagged freguency distribution of the input 
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intersection set. Then the ’findsumS’ rule is called recur- 
sively with the remainder of the first distribution substi- 
tuted for the original distribution. When the basis condi- 
tion for this recursion is satisfied, the running sum will 
represent an upper bound on the size of the intersection 
because only items which occur in the first set of the 
intersection can occur in the intersection set. Those that 
do occur in the first set can cccur in the intersection set 
with a frequency no greater than the frequency with which 
they occur in that component set in which they occur the 
least frequently. Therefore the sum of these minimum 
frequencies represents an upper bound on the size of the 
intersection set. Note also that the new list representing 
the upper bound on the level five distribution of the inter- 
section is asserted into the system cache to convert this 
intersection into a simple set with respect to level five. 
In the case of this query this conversion is not really 
necessary because the intersection set is the higherst level 
set for this query. However, the conversion is completed to 
maitain the generality of the ’dosup’ rules. (An alternate 
approach to this problem is presented in chapter IV.) 

The concept of a union or intersection being converted 
into a simple set is fundamental to the execution of this 
system. When the upper level 'supall' rule returns from its 
execution, not only will it return a list of upper bounds on 
the sizes of the component sets in the input set list, but 
it must have caused the additional side effect of tabulating 
the appropriate level information for each of the component 
sets into the system cache. In this way, when the ’dosupS’ 
rule is called with the input set list, the system is 
assured that level five information is in fact tabulated for 
every set in the input list. 

The above discussion applies regardless of the level of 
knowledge specified by the user. The level x dosupx rule 
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dosupS (size,and (L) Val, I) not (1=5) . 
dosupS (size, and (L) , -Val,I) 1 = 5, 

data(atrrxDUtes,Alist) , 
map {suinfreqs,Alist , L ,Vlist), 
minI{Vlist>al) . 

map[punctor^Pirg1 | Arglist ],Otherargs, [ Val 1 Others ]) :-] 

maJtelis t (Functor, Arg 1, Otherargs, Val, Function) , 
X =.. Function, cal It X) , 
map (Functor, Arglist, Otherargs, Others) . 
maJcelist (F, Ai , Otherargs, val, Besultlist) 

append {[ FI [ A 1 1 Otherargs ] ],[ Val ] ,Resul tlist) . 

sumfregs(Attr,L, Val) getf regs5 (L,sup,Fl, Attr) , 

f indsumS (FI, Val ,A1) , 

asserta (cache(fregdist5,and (L) , Attr , Al, sup) ) . 



getfregsS ff f^lfst?Sest 1,E,[ FI BlJ,Attr) 

isdata (freqaist., First- E,F,Attr) , 
getfregsS (Rest, E, Fl, Attr) . 

f indsumS ([ [ ] I Others], 0,[ ]) • 

f indsumS ([ [ ITlni/ Freg j | Rest ]j Others J,S,rr Im,Min]l Al]) 
minfregofitem (Im, Freg, Min, Others) , 
f indsumS ([ Rest] oth ers j , Old , Al) , 

S is Old + Min. 



minfregofitem 



minfregofitem Item, 0,0, ) 
minfregofitem’’"-- ----- 



Item,Sofar ,Sof ar,[ ]) 



.Item,Sofar , Min ,[ First I Rest ]) 
gmember (item, First , New) , 
min (Sof ar,New, Sof ar2) , 
minfregofitem (Item ,Sofar2, Min , Rest) 



Figure 2. 1 PBOIOG Code for the Level Five Opper 
Bound Calculation for Intersection Sets. 



will always return an upper bound for the input set and will 
always cause the side effect of asserting into the system 
cache the level one and the level x information about the 
input set thus converting that set into a simple set with 
respect to level x. An analogous construction exists for 
the doinf rules. The general ’doinf rule calls 'doinfl' 
through 'doinf S' with only the level one and the specified 
level calculations being executed. The level one 'doinf 
rule calls the 'infall' rule which returns a lower bound on 
the size of all the component sets and causes the side 
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effect of tabulating jwer bounds on the level 1 and thespe- 
cified level inform ion about the input set. 

We have thus far avoided discussing the calculation of 
the estimate for the size of the input set. Recall that the 
minimal cover set used to calculate the upper bound on the 
example set was different from the maximal union covered by 
the queried range used to calculate the lower bound. In 
order to obtain a truly meaningful estimate as to the size 
of this set, we must calculate an estimate for both of the 
above sets and average the results. For this reason, the 
•doest’ rule is called twice, once with each of the two 
forms of the input set used as its argument. The execution 
of the ’doest* rule is quite similar to the ’dosup’ and 
•doinf rules except that only level one estimates are 
calculated. The 'doest* rule calls the 'estall* rule for 
the input set list which returns an estimate as to the size 
of each of the sets in the list. The 'estall* rule either 
returns the tabulated size of each component set or calls 
the 'doest* rule recursively for sets which have no tabu- 
lated size estimates available. Of course, when each of 
these lower level 'doest* calls returns, the argument set 
has an estimate as to its size recorded in the system cache. 
Because the estimate rules were the same for all levels of 
knowledge, these calculations did not produce ve_y inter- 
esting results. For that reason, the estimates are not 
discussed at all in chapter III. 

So far we have only discussed with any measure of detail 
the implementation of the rules for level five upper bound 
calculations. The general structure of the calculation for 
the remaining bounds calculations is similar, however, there 
are some significant differences. First of all, there are 
no rules to calculate new upper bounds for levels two 
thr. agh four ur, 'n sets and new lower bounds for levels two 
through four in rsection sets. Therefore, the 'do* rules 
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for these levels merely use the already calculated level one 
bounds and cause the side effect of converting the input set 
to a simple set with respect to the given level. Regardless 
of whether they calculate a new bound on the size of the 
input set all of the level calculations above level one 
involve a call to the 'map* functional to execute a given 
calculation for each attribute in the attibutelist . The 
calculations are also guite similar. Each one involves a 
retrieval rule which makes a list or lists of the informa- 
tion in the database abstract corresponding to the input 
sets. Then a recursive function passes through this infor- 
mation list building a bound value for the size of the set 
and a bound for the appropriate level information corre- 
sponding to the set. (If the calculation corresponds to one 
of the levels where no new bounds are possible, this recur- 
sive function calculates only a bound on the appropriate 
level information.) This level information is then asserted 
into the system cache and the calculation is repeated with 
the next attribute. When the map functioanl returns its 
list of answers, the sup rules choose the minimum from the 
list as the least upper bound while the inf rules choose the 
maximum from the list as the greatest lower bound. For a 
thorough discussion of the equations used in these rules, 
refer to [Ref- 2]. 

E. SUMHAEY 

This chapter has discussed in detail the iinplementation 
of the Database Abstract Query Estimation System, including 
its two subsystems, the database abstract generation 
subsystem and the query estimation subsystem. The following 
chapter will present the results of various test cases 
executed on this prototype system and discuss the concept of 
developing the optimum database abstract. 
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III. AHALYSIS OF SYSTEM TESTS 
A. TESTING STRATEGY 

System testing of the DAQUES system was not conducted in 
a completely rigorous manner, however, sufficient test cases 
were chosen to demonstrate the strengths and weaknesses of 
the system and to show how the system can be customized to 
suit the requirements of different users. Three different 
factors were varied during the tests to demonstrate several 
characteristics of the system. These factors were the level 
of knowledge of the database abstract, the granularity of 
the database abstract, and the use of partition aliases to 
construct the database abstract. The ten queries listed in 
table I were run on each of the test database abstracts to 
determine the effects of varying the above factors. 
Inspection of table I will show that the queries used repre- 
sent a good variety of different sets that the user may 
query. Of the ten sets in the table, five are intersections 
and five are unions. Some of the sets include a large 
percentage of the total database while others represent only 
very small sets. A union of three sets was chosen as one of 
the ten sets to demonstrate the system's ability to handle 
larger unions or intersections. Each of the database attri- 
butes was used several times in the ten queries to provide a 
more even mix of test cases. 

Three characteristics of the system execution were 
compared during the tests in order to arrive at useful 
conclusions. The first of these characterist ics was tight- 
ness of the upper and lower bounds calculated by the system. 
This measurement apperared to be the most intuitive and 
easily observed indication of how accurate the system 
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TABLE I 

Ten Queries Osed to Test the System 



Query Number 



Set 



Actual Size 



1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

NOTE ( 1) 



9-37.3) 



da(12-26) + temp (36. ^ 

Choi (1 15,277) * pred(3-20) 

patno{17-47) + da (17,47) 
temp (37.8, 40. 8) * male 

female ♦ chol ( 12 9 -245) 
patno(3,44) + da (30,95) 
patno (25,27) * da (13, 18) 

■ ’ + pred(22,30) 

* patno (44,4.7) 



+ pred(17,47) 



2 90 
340 

3 85 
8 

2 44 

3 27 

, . . * da (13, 18) 17 

chol(115-230) + pred(22,30) 435 

temp (36-8-37) « patno (44 ,4.7) 13 

pred(12,17) + da(19,20) 101 

The first query can be paraphrased as the set of 
all patients who either have a disease activity 
between 12 and 26 inclusive or have a temperature 
between 36.9 and 37.3 inclusive. The second query 
refers to the set ofall patients who have both a 
cholesterol level between115 and 277 inclusive and 
prednisone level between3 and 20 inclusive. 

NOTE (2) Query number 4 is also referred to as: 
hitemp * male 

and Query number 6 is referred to as; 
lowpat + hida 

when using the aliased database abstract. 



responses were. The second chciracteristic measured was the 
response time of the system for the given input query. This 
measurement used the PROLOG ’cputime’ function which in turn 
uses the Unix operating system time utility. Therefore the 
response time measurement may vary plus or minus several 
seconds depending on the user lead on the system. The test 
cases were all run under lightly loaded conditions to mini- 
mize the adverse effects of these discrepancies. Finally, 
the third characteristic measured as output was the storage 
requirements to hold the database abstract file. The PROLOG 
•consult’ function returns the size of the consulted file 
following a successful consult operation. This logical file 
size was* used because it represents the the storage required 
by the database abstract in the internal PROLOG database. 

Of course, a better system is determined by a tighter 
bound on accuracy, a shorter speed of execution, and a 
smaller storage requirement for the database abstract. The 
following three sections of this chapter will discuss each 
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of the variables in turn and describe how varying thee 
affected the system characteristics mentioned above. 

B. LEVEL OF KNOiLEDGE OF THE DATABASE ABSTRACT 

The ten queries of table I were used to test this factor 
by running each query five times, once for each level of 
knowledge. Figure 3.1 shows the graph of average system 
accuracy and speed of execution over the ten queries tested. 
This graph was prepared using a database abstract with all 
of the attributes partitioned into four divisions, however, 
the general shape of the accuracy curve is identical for all 
other granularities tested. This curve does show only the 
average accuracy and response time results of ten queries, 
however the reader can gain an appreciation for the general 
tendencies of the system accuracy and response time as the 
level of knowledge is varied even if no appreciation is 
gained for the sizable variance in responses among the 
queries tested. 

This graph shows the manner in which the accuracy of the 
system response as well as the speed of system execution 
increases as the higher levels of knowledge are used for the 
queries. (Note that an increase in accuracy of query 
respoi.se translates into a downward slope on the graph.) 
System accuracy increases at a relatively constant and 
rather slow rate for the first four levels, however the 
increase in accuracy takes a relatively sharp curve upward 
between levels four and five. One reason for this greater 
increase in accuracy is the fact that the queries tested 
included five unions and five intersection sets. Recall 
that the system did not have any rules for calculating new 
upper bounds for set unions for levels two, three, and four 
nor did the system have rules for calculating new lower 
bounds fo^ set intersections for the same three levels. 
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DATABASE: Medical Information Database 

QOEEIES SHOWN: Average of five unions and five inter- 

sections shown in table I 



Figure 3.1 Accuracy and Speed of Response 
for Five Levels of Knowledge. 

These levels for which news bounds could not be calculated 
merely used the level one bounds for that query. Therefore, 
It was not until level five was chosen that any given query 
could show an improvement over level one in both the upper 
and lower bounds simultaneously. 



53 



Also contributing to the relatively sharp increase in 
system accuracy when level five calculations were chosen is 
the fact that level five knowledge seems to represent a 
sharp increase in the amount of the original information 
content of the database retained by the database abstract. 
The lower level database abstracts contain information about 
the frequency of occurrence of certain values in each parti- 
tion set, but no information about the values to which these 
frequencies refer. The usefulness of comparisons between 
the database abstract information stored about component 
sets in a union or intersection is limited because there is 
no way of knowing to which values these frequencies refer. 
In some cases, worst case bounds must assume that the 
frequencies refer to the same value and in other cases the 
worst case assumption is that the sets are disjoint. 
Because bounds must be calculated using worst case assump- 
tions, the lack of information about the values to which the 
database abstract frequencies refer results in a serious 
degradation in accuracy of system responses. The tagged 
frequency lists of level five contain these values and asso- 
ciate them to the frequency with which they occur. This 
added tag information appears to raise the real information 
content of the database abstract to a plane much higher than 
the previous levels. These results raise the interesting 
question of whether an intermediate level of knowledge ought 
to contain the mode value as well as the mode frequency and 
the second most frequent value together with its frequency. 
Perhaps, rules more similar to the level five rules could be 
developed to improve the accuracy of the system responses 
without a significant degradation of system response time. 
This question is discussed in mere detail in chapter IV. 

The second curve of figure 3. 1 shows the steady increase 
in the time required to complete a system query as the level 
of knowledge is increased from cne to five. The increase in 
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this system characteristic is nuch more constant than the 
increase in system accuracy. This trait indicates that the 
jump in accuracy encountered in using level five information 
is not matched by a similar increase in response time. 
However, figure 3.2 shows that the storage area required for 
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Figure 3.2 Storage and Response Times for the 
various Level DBAs of the Medical Database. 
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the database abstracts for the various levels of knowledge 
also increases at a constant rate for levels one through 

five. The average response time for the ten queries is also 

shown on figure 3.2 to demonstrate that both of these char- 
acteristics increase at relatively the same rate as the 

level of knowledge is increased. Figure 3.3 shows a plot of 

storage required for the various database abstracts versus 
the average response times of the ten queries tested for the 
same database abstracts. This figure demonstrates that the 
increase in both storage requirements and response times is 
relatively constant as the level of knowledge is increased 
from one to five. We can conclude from this graph that the 
tradeoff made for greater accuracy at higher levels of 
knowledge results in comparable penalties of increased 
response times and database abstract storage requirements. 

From the preceding discussion, we can conclude that if 
the user needs relatively tight bounds on the size of the 
set being queried and is not overly concerned with the 
response time or the memory required to execute the query, 
level five is most definitely the optimum level of knowledge 
to use. However, if he is limited severely by the amount of 
memory available in which to store the database abstract, he 
may wish to choose to use a lower level of knowledge. With 
this option, however, he will suffer a large penalty in the 
accuracy of the answer calculated by the system. If memory 
is not a major concern, the user will obtain the most accu- 
rate results per second of response time using level five 
knowledge. 

C. GBANOLARITI OF THE DATABASE ABSTRACT 

^ • Varying Gran ular ity of All Attribute s 

The decision as to which level of knowledge to use 
for a given query is one which is made by the user. Because 
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Figure 3.3 Space 'Versus Time Plot 
for Various Granularities, 

all five levels are available tc the user at any given time, 
the user may choose one level fcr one query to obtain satis- 
factory results and a different level for a subsequent 
query. On the contrary, the decision as to the granularity 
of the database abstract is made by the database adminis- 
trator and cannot be changed unless the database adminis- 
trator recalculates the entire database abstract from the 
original database. Because this system is primarily 
designed to be used with very large databases or when access 
to the actual database is not available, this decision is 
much more permanent than the one discussed in the previous 
section. 
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The database abstract generation subsystem of DAQUES 
is capable of constructing a database abstract with a 
different granularity used for each of the numeric attri- 
butes. We will attempt here to discover what characteris- 
tics of a particular attribute tend to make a finer 
granularity more advantageous in terms of accuracy and to 
make that improved accuracy worth the penalty of increased 
system response time and increased storage requirements. 

Figure 3.4 shows the differences between the upper 
and lower bounds, averaged over the ten queries, for the 
five levels of knowledge. The different curves represent 
granularities of two, four, six, and eight partitions for 
all of the attributes in the sample database. These curves 
demonstrate that the accuracy cf results increases signifi- 
cantly as the granularity of all of the attributes is 
changed from two to four. However, the increase in accuracy 
as the granularity is shifted from four to six and from six 
to eight is less significant. Figure 3.5 shows that the 
increase in average system response time is fairly constant 
as the number of partitions in the database abstract is 
increased from two to eight. Figure 3.6 shows that the 
increase in storage required for the various database 
abstracts is relatively constant as the granularity is made 
finer. 

Figure 3.11 shows the storage requirements of the 
various mixes of the database abstract for levels of knowl- 
edge one through five. From this figure we can see that 
partitioning an attribute more finely results in a very 
predictable increase in the size of the database abstract. 
For levels one through three doubling the number of parti- 
tions for the numeric attributes nearly doubles the storage 
requirements of the database abstract. For levels four and 
five, doubling the number of partitions results in a more 
finely partitioned database abstract approximately one and 
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Figure 3.4 Accuracy for ?arious Granularities. 
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one half times the size of the more coarsely partitioned 
one. This smaller increase in the storage requirements at 
higher levels is due to the fact that many of the values in 
the frequency lists of these two levels are merely redistri- 
buted as the attibutes are partitioned more finely. 

From these curves we can conclude that increasing 
the number of partitions on all attributes in the database 
abstract causes a steady increase in the penalties of 
greater system response time and storage requirements but 
results in smaller and smaller improvements in the accuracy 
achieved. Therefore it appears that at a certain point the 
penalties incurred by dividing the database into more 
partitions is not worth the small increase in accuracy of 
query responses obtained. Although these conclusions have 
been drawn from experimentation on only one databse, it 
appears resonable to assume that the results could be gener- 
alized to other databases. Considering this restiction on 
the usefulness of increasing the granularity of all the 
attributes of the database abstract, it seems advantageous 
to investigate whether dividing only certain attributes 
more finely and leaving the others less finely partitioned 
will result in nearly the same accuracy improvements for a 
smaller penalty in response time and size of the database 
abstract . 

2. Varyi ng Granular i ty of ^ly One or Two Attributes 

To investigate the advantages of partitioning 
certain attributes more finely than others, the system was 
tested with seven different mixes of partitioning granu- 
larity. For each of the numeric attributes in the database, 
a database abstract was created with the given attribute 
divided into eight partitions and the remaining attributes 
divided into only four. Two additional database abstracts 
(mix and nmix) were also created, 'mix' dividing the temp 
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Figure 3.5 Response Tiues for Different Granularities. 

and chol attributes into eight partitions and the remaining 
numeric attributes into four partitions. The ’nrnix* data- 
base abstract reversed this construction resulting in four 
partitions for the temp and chol partitions and eight parti- 
tions for the other other numeric attributes. Table II 
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Figure 3.6 Storage Eeguirements of Hedical 
Information DBAs for Various Granularities. 
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provides an explanation of the granularities of the various 
attribute partitions in each of the database abstracts named 



TABLE II 

Breakdown of Various {fixes of DBAs 
DBA NA{fE NDMBEE OP PARTITIONS FOR THE GIVEN ATTRIBUTE 
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in the figures in 


this 


chapter. 


Figures 3.7 


through 3.9 


represent the accuracy results for 


three 


of the 


more inter- 



esting queries tested on this system. Each figure shows the 
actual upper and lower bounds calculated for all seven of 
these database abstracts together with the actual value of 
the query calculated from the original database. 

Careful study of these figures shows that improve- 
ments in accuracy were shown only when one of the components 
of the intersection or union set being queried involved the 
attribute that was partitioned more finely. For example, 
figure 3.7 shows that the query requesting the size of the 
set of all males with temperatures between 37.8 and 40.8 
inclusive resulted in one response from all database 
abstracts with the temperature attribute divided into only 
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four partitions and a different, more accurate response when 
utilizing the database abstract with the temperature attri- 
bute divided into eight partitions. This result is also 
present, but is less obvious in those queries where the 
queried set involves two numeric attributes. In figure 3.9, 
the query involves the patno and temp attributes and the 
database abstracts which partition these attributes more 
finely, tmix, pamix, mix and nmix all result in bounds which 
are tighter than the other database abstracts. However, it 
is interesting to note that the database abstracts which 
partition the patno attribute more finely result in a ach 
greater improvement in accuracy than those which parti .ion 
the temp attribute more finely. The same phenomenon is 
present in figure 3.8 with a finer partitioning of the pred 
attribute rather than the chol attribute resulting in the 
greater improvement. 

It is suspected that because the temp and chol 
attributes are highly skewed (i.e. have a very large mode 
frequency), increasing the number of partitions for these 
attributes from four to eight does not have the same ability 
to increase the information content of the database abstract 
as with more evenly distributed distributions. In the 
highly skewed distributions, one particular value, the mode, 
will always occupy only one partition, possibly along with 
several other values. Varying the granularity of the parti- 
tion cannot shift any of these mode values to separate 
partitions. Consequently, this very large partition, which 
contains a large percentage of the database tuples, will 
always be present and the amount of information that can be 
stored about it is independent of the granularity of the 
database abstract. Dividing the partition more finely for 
such attributes only results in increasing the information 
content of the remaining partitions which represent only a 
modest percentage of the entire database. 
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Figure 3.8 Accuracy of Query 8 Varying 
the Granularity or DBAs. 

Another reason for the increased accuracy of system 
response with database abstracts which are partitioned more 
finely is that the union of actual partition sets chosen to 
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Figure 3.9 Accuracy of Query 9 Varying 
the Granularity of DBAs. 

cover the input set defined by a range of attribute values 
is likely to fit that range more closely than it would using 
a database abstract partitioned more coarsely. Figure 3.10 
demonstrates this point graphiceilly . In this figure, we see 
the responses to system queries on an increasingly larger 
range of the numeric attribute, temperature. When the 
queried range is very small, say 40 to 40.8, the minimal 
union of actual partition sets for both database abstracts 
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consists of only one set. However, the one set for the 
finer partition is much smaller than that for the coarser 
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one. Therefore the calculated upper bound for the query, 
which is based solely on this minimal union, is much closer 
to the actual size of the set when a finer partitioning is 
used for the database abstract. In the same way, the calcu- 
lated lower bound is based on the maximal union of partition 
sets which is covered by the desired range. For a very 
small queried range, this union is the empty set, producing 
a lower bound of zero. However finer partitioning of the 
database abstract results in smaller partitons and conseq- 
uently a given queried range is more likely to cover such 
smaller sets than the larger ores resulting from a coarser 
partitioning. So, for a random range queried by the user, 
the finer the partitioning of the database abstract, the 
more closely the unions actually used for calculating the 
upper and lower bounds will be to the queried range and 
consequently the more accurate the results will be. Note 
that figure 3.10 shows the system calculating the same upper 
and lower bounds for all eight query ranges tested when the 
coarse database abstract was used. When the finer parti- 
tioning was selected, however, the system was able to calcu- 
late several different bounds and therefore was able to 
discriminate between these various queries tested. 

Returning to the three example queries of figures 
3.7 through 3.9, if we examine the response times for these 
calculations, we find that the level five calculation for 
the query shown in figure 3.7 ranged from 6.5 to 7.5 seconds 
for the database abstracts which divided the temp attribute 
into only four partitions, while response times of 18.6 to 
25.2 were experienced for those database abstracts with the 
temp attribute partitioned eight ways. The time 18.6 
resulted from the database abstract which only divided the 
temp attribute into eight partitions while the 25.2 time 
resulted from the database abstract which partitioned all 
the attributes eight ways. So it seems that increasing the 



granularity of even a single attribute in the database 
abstract will result in slower response times on the same 
order of magnitude as increasing the granularity of all the 
attributes. Note also that doubling the number of parti- 
tions in the database abstract results in considerably more 
than doubling the system response time. This characteristic 
is due to the fact that response time is not only a function 
of the size of the database abstract but also of the 
complexity of the input set. A set specified by a range 
such as: 

male * chol (115,270) 

will be preprocessed into form a) below for for a database 
abstract with four partitions and into form b) for a data- 
base abstract with eight partitions. 

a) and ([ male, or ([ chol (1) ,chol(2) ,chol (3) ]) ]) . 

b) and ([ male, or ([ chol (1) , chol (2) ,chol (3) , 

chol(4) ,chol(5) ,chol (6) ]) ]) . 

Obviously the set of form b) will require much longer to 
process than form a). Similarly, when only the queried 
attribute is partitioned more finely, the input set is still 
preprocessed into form b) and the complexity of the set 
causes the system response tine to approach the response 
time of the system when the database abstract with all 

attribute partitioned finely is used. If the query set does 
not involve the single attribute which is partitioned more 
finely, then the system response time is closer to the 
response time of the system using the databse abstract with 
all of the attributes partitioned more coarsely. However, 
keep ir mind that a finer partitioning of a single attribute 

will result in increased accuracy only for those queries 

which involve the attribute partitioned more finely. From 
the above discussion we can conclude that if the user 
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community needs additional accuracy and a single attribute 
predominates the system queries, it may be worth mixing the 
granularities of the various attributes in the database 
abstract to obtain the optimum accuracy at the minimum cost 
in system response time. 

Figure 3.11 shows the storage requirements for the 
database abstracts of table II. Inspection of this figure 
reveals that increasing the granularity for only one attri- 
bute in the database abstract results in a much smaller 
increase in the storage required than increasing the granu- 
larity of all the attributes. Therefore, if primary memory 
available is a major concern, dividing of only the most 
commonly used attributes into finer partitions may be the 
optimum method of constructing the database abstract. 

From the above paragraphs, we can conclude that 
there is no single optimum partitioning strategy which will 
be advantageous for all queries. However, if the database 
administrator is aware of a particular attribute which is 
involved in a large percentage cf the system queries, he may 
decide to divide that attribute into more partitions than 
the other attributes. For this reason, the query subsystem 
of the DAQOES system allows the user access to three 
distinct database abstracts for his queries. The intent of 
this arrangement is to provide a database which is rather 
coarsely partitioned as the ’normal' database abstract, 
resulting in rather loose upper and lower bounds but more 
rapid response. If the user chooses the 'alternate* data- 
base abstract, he will be using one in which all the attri- 
butes are partitioned more finely but the response time for 
each query will be much slower. Finally, the user may 
choose the 'special' database abstract which is partitioned 
finely on one or two commonly queried attributes and more 
coarsely on the remaining attributes. This database 
abstract will result in more accurate results than the 
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’normal’ database abstract for only those queries involving 
the ’special’ attribute or attributes and the response time 
will be less than that of the same query executed with the 
’alternate’ database abstract. Because the system only 
consults one of these database abstracts into the PROLOG 
internal database at a time, the user does not pay any added 
price in storage requirements other than the secondary 
storage required to hold the three separate database 
abstracts. Because secondary storage is usually readily 
available compared to primary storage in the form of the 
PROLOG internal database, this penalty seems a small price 
to pay for the added flexibility of this system. 

D. USE OF ALIASES TO IHPROVE SYSTEM ACCORACY AND RESPONSE 

TIME 

The previous section analyxed a method with which the 
database administrator could tailor the database abstract to 
the specific requirements of a particular group of users by 
partitioning more finely those attributes which are involved 
most frequently in the query sets of the users. This 
section analyzes another method of tailoring the database 
abstract to users by partitioning various attributes along 
boundaries which represent important sets in the application 
field and then assigning aliases to those partition sets in 
such a way that the aliases describe the sets. This method 
results in the database abstract which is the most specific 
to a class of users but also produces the most accurate 
results for queries on the special aliased sets. 

The method of defining special boundaries for meaningful 
partition sets of a particular attribute was described in 
Chapter II in the section on the database abstract genera- 
tion subsystem of DAQUES. The technique outlined in that 
chapter was used to define partitions of the patno, da and 
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temp attributes. assuming that the boundaries used repre- 
sented meaningful descriptions of low, middle and high 
ranges for those attributes. The remaining two numeric 
attributes were partitioned using the algorithm of the 
generation subsystem with no attempt to produce meaningful 
partition sets on those attributes. Figures 3.12 and 3.13 
show the query results of queries U and 6 of table I using 
the specialized database with only three partitions for each 
attribute. On separate curves are shown the results of the 
same queries using algorithmically generated database 
abstracts containing four and eight partitions. A compar- 
ison of these curves shows graphically the tremendous 
improvement in accuracy which results from using the alias 
technique. Because the specialized database abstract uses 
only three partitions for all the attributes, it requires 
much less storage than the database abstracts with combina- 
tions of four and eight partitions for the attributes. The 
system is able to evaluate queries on specific partition 
sets much more efficiently than it can evaluate queries on 
sets defined by arbitrary ranges of the attribute. This 
capability is caused by the method of substituting unions of 
actual partition sets for sets using specific ranges as 
described in chapter II. Therefore, the queries on the same 
sets executed much more rapidly with much more accurate 
results for this specialized database abstract than for the 
previous unspecialized database abstracts. These response 
times are demonstrated graphically in figures 3.14 and 3.15. 

E. COHCIDSIOHS 

From the tests discussed in the previous section we can 
now develop several conclusions about what the optimum data- 
base abstract would be. Recall that our goals for this 
system, mentioned in chapter I, were to determine the 
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optimum level of knowledge, and the optimum granularity of a 
database abstract for the query estimation system. Here, 
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optimum refers to the system with most accurate responses, 
the shortest response times and requiring the least amount 
of primary storage for the database abstract. Of course. 
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all of these goals cannot be realized at the same time 
because they conflict with each other. Therefore tradeoffs 
must be made to produce the system which best suits the 
needs of all the users. 

First and foremost, it is very important that the data- 
base administrator consider which attributes and sets are 
most commonly queried in the application field. If one 
particular attribute will be involved in a large percentage 
of the system queries, that attribute should be partitioned 
more finely than the other attributes. The question of 
exactly how finely the partitioning should be has not been 
determined through these tests. However it is suspected 
that there exists a granularity for each attribute which is 
optimum in the sense that partitioning the database more 
finely on this attribute results in a sharp drop in the 
improvement in accuracy achieved while the increase in 
system response time at least remains constant. It is also 
believed that this optimum granularity is a function of the 
ratio of the mode frequency of the attribute in the database 
to the size of the database. For example, a highly uneven 
distribution in which over half of the tuples of the data- 
base have the mode value will have a much lower optimum 
granularity than a distribution in which the values of the 
attribute are evenly distributed over the range of values of 
the attribute. Of course, the number of distinct values in 
the database for each attribute is an upper bound on the 
optimum granularity for that attribute because if each 
partition contained tuples with only one value for the given 
attribute, any finer partitioning on that attribute would 
not be possible. 

Another way which the database administrator should be 
familiar with the application field is to identify some 
partitions of database attributes which are meaningful in 
the application field and therefore are likely to be used 
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often in queries. The example used in chapter I was 
that of 3 low, normal and high temperature ranges. If 
such meaningful partitions are identified, the database 
administrator can define the database abstract partitions to 
correspond to the application area partitions of the attri- 
bute and result in greatly improved accuracy and system 
response time. 

The optimum granularity of the database abstract, there- 
fore depends on the needs of the user. If the database 
administrator must provide a system to serve the needs of a 
variety of users, he can use the normal, alternate and 
special database abstracts to provide one general purpose 
and two specialized database abstracts. If this flexibility 
is still not sufficient, the database administrator must 
make tradeoffs between the needs of his various users to try 
to best serve the entire user community. 

The question of the level of knowledge which must be 
used to obtain the optimum results for a particular query 
was determined not to be as important as the granularity of 
the database abstract. Because the user, rather than the 
database administrator chooses the level of knowledge to be 
used for each query, such a decision can easily be altered 
if the results obtained are not sufficiently accurate. 
However, the tests conducted indicated that either level one 
or level five were the optimum levels to use because these 
levels were capable of calculating new upper and lower 
bounds on both union and intersection sets. 

F. EFFECTS OF OSIHG PEOLOG TO IMPLEHENT THE SYSTEM 

It was mentioned in chapter II that PROLOG was chosen as 
the implementation language for this system primarily 
because it was expected that this language would provide a 
rule base which was more logical and easily understood than 
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that wtich a Lisp system might provide. This advantage was 
realized to even a greater extent than anticipated, but 
additional equally important advantages as well as several 
disadvantages to using PROLOG were also realized. Because 
this author’s experience with Lisp is quite limited, we will 
concentrate here on the characteristics of PROLOG rather 
than attempting to compare the two languages. 

The syntax of PROLOG (with some exceptions, e. g. I and 
=..) is extremely natural for a rule based system. Rules 
can be constructed individually and translated into PROLOG 
clauses independent of the other system rules. The form of 
a PROLOG clause lends itself to working on one level of 
abstraction at a time using lower level calls to rules whose 
effect is envisioned but whose implementation has been put 
off until later. The primary advantage to the PROLOG syntax 
is the way it lends itself tc organizing a program into 
several files of rules, each of which is understandable 
independent of the others. Fcr example, a reader could 
examine the ’infs’ file and understand the rules for 
obtaining lower bounds on the input sets without ever 
knowing that the ’sups’ or ’ests’ files exist. 

On the other hand, it is quite possible that a reader 
would need to refer to the ’master’ file to determine the 
effect of many low level functions which are used by the 
’infs’ file. This leads us to one of the disadvantages of 
the PROLOG system, that is the lack of many built in func- 
tions whose effect is already well understood by the user 
community. For example, the system does not have a ’member’ 
function to check for the membership of an individual 
element in a list or an ’append’ function to append two 
lists together. Even the most basic Lisp systems have such 
built in functions. On the other hand, it is quite easy to 
write these functions in PROLOG and, once written, they can 
be used over and over again in various systems. So, except 
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we do not see 



for the penalties of lower system efficiency, 
this disadvantage as too great a problem. 

In [Bef. 1], that author spent a great deal of his 
system’s execution time determining which rules in his rule 
base to apply and ensuring that all the appropriate rules 
were in fact used for every query. These problems were a 
direct result of using Lisp as his programming language. 
PROLOG solves this problem automatically through its pattern 
matching mechanisms. By providing a sufficient number of 
arguments in the upper level goals, the PROLOG system can 
ensure that when the appropriate values are substituted into 
these goals, only the clause heads of the desired rules will 
match the goal. Of those rules that match, only the first 
successful rule will affect the system variables. 
Therefore, the programmer can dictate the priority of execu- 
tion of his rules by merely placing the rules in the proper 
order in the rule base. PROLOG also provides built in 
features such as the ’cut’ predicate to alter the normal 
search through the rule base whereas in other programming 
languages, such mechanisms would have to be built from prim- 
itives. 

Another important way in whicn PROLOG lends itself to 
implementing a rule-based system is in its automatic manage- 
ment of the internal PROLOG database of rules and facts. 
The system in this paper is constructed as if two separate 
storage areas existed for the original database abstract and 
the system generated cache. In fact, these rules and facts 
are all stored in the PROLOG heap. New items can be added 
to the heap with the ’assert’ command and unwanted facts or 
rules can be removed using the ’retract’ and ’abolish’ 
commands. In this way, converting a set union into a simple 
set with respect to a given level is just a matter of 
asserting a new set of facts into the internal database. 
These mechanisms to automatically manage the internal 
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database eliminate many of the details which would have to 
be worked out to build such a system in another programming 
language. 

Chapter IV discusses in some detail the possible exten- 
sions to this system which could be effected to increase its 
functionality. These extensions are made possible primarily 
because of the independent execution of the individual 
PROLOG rules when these rules have distinct clause heads. A 
detailed explanation of this idea is deferred until chapter 
IV but suffice it to say here that this rule-based system 
could easily be extended to answer queries on statistics 
other than ’size* and on sets other than unions and inter- 
sections. This flexibility is caused primarily by the 
structure inherent in a PROLOG rule base. 

A final advantage of using PROLOG to implement this 
rule-based system may not be fully realized for several 
years. Kith no conscience effort to do so, this system is 
built to be executed very efficiently on a network of 
parallel processors. For example, if a parallel conjunc- 
tion were used between the subgoals 'dosup', ’doinf, and 
’doest*, these three independent calculations could be 
carried out in parallel and thus system response time could 
be reduced considerably. Even more dynamic would be the use 
of such a parallel conjunction in the ’map’ functional to 
cause parallel execution of the individual rules for all the 
attributes simultaneously on different processors. These 
two simple changes in the system would reduce system 
response time by about an order of magnitude. If a concen- 
trated effort were made from the start of system construc- 
tion, perhaps even greater improvements in response time 
could be realized. 

With all the above points considered, the decision to 
use PROLOG as the implementation language for this rule- 
based system appears to be a wise one. For a penalty of 
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slightly longer response times and more work to build basic 
functions, the PROLOG system provided the advantages of 
logical and easily understood syntax, automatic pattern 
matching mechanisms for rule application, easy-to-use tools 
for management of the internal database of rules, and a 
system which can easily be expanded to increase its func- 
tionality. Both of the disadvantages of PROLOG stated above 
are likely to be overcome in the near future when more wide- 
spread use of this language will bring about a larger 
library of built-in system functions and more research into 
implementing PROLOG in parallel architectures will dramati- 
cally improve system response times. 
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IV. POSSIBLE EXTEN SIOHS/IMPHOVEMENTS TO THE SYSTEM 



A. EXTENDING THE SYSTEM TO ACCEPT QUERIES ON OTHER 

STATISTICS 

The DAQUES system was written with many seemingly unnec- 
essary arguments being passed from clause to clause in the 
higher level rules. For example, in all three 'do' rules, 
only one argument is currently needed by the system, that is 
the set being queried. However, the rules also have places 
for passing statistic and attribute arguments. Currently, 
only rules for the 'size' statistic exist in the database of 
rules. All these rules have clause heads which appear like 
the following: 

do sup (size, and (Set_List) , _, Answer) 

or 

dosup (size, or (Set_Iist) ,_, Answer) 

Here, the third argument would contain the attribute of the 
statistic to be queried. Since the 'size' statistic is not 
associated with any particular attribute, this argument 
remains anonymous in the current system. 

In order to extend this system to accept queries on 
other statistics of the input set, the programmer would 
simply have to add a new file of rules to supplement the 
'sups', 'infs', and 'ests' files. For example, to allow the 
system to request upper and lower bounds and estimates on 
the mean value of a particular attribute in the queried set, 
the programmer would simply have to add rules with clause 
heads of the form: 

dosup (mean, and (Set_List) , At tribute. Answer) 

and 

doinf (mean, and (Set_List) , Attribute, Answer) 
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and 

doest (mean, and (Set_Lis t) , Attribute, Answer) 

Similar rules would have to be added for set unions. These 
new rules would in no way interfere with the current system. 
No hidden side effects from these new rules could possibly 
alter the execution of the current rules because the call to 
the appropriate ’do* rule would only match the rule in the 
rule base which had the proper statistic as its first argu- 
ment. Cf course, these new rules could use many of the 
already written general rules which are located in the ’mas- 
ter* file, thus saving the programmer much wasted effort. 

It is also anticipated that little or no degradation in 
system response time would be experienced by augmenting the 
system in this way. Once the initial invocation of the ’do’ 
clause chose the proper rule to use for the statistic being 
queried, the rules for the other statistics should have no 
effect on the speed of execution. Of course, the added 
rules would increase the space needed to hold the system in 
the PEOLCG internal database. If space became a problem, 
the programmer could sacrifice response time for added space 
by nly consulting the rules for the specified statistic and 
th . retracting these rules from the rule-base and asserting 
a different set of rules if a different statistic is 
queried. 

This system could be extended to accept queries on any 
statistic provided that the appropriate rules could be 
devised to obtain useful bounds and estimates on those 
statistics. Additionally, the system could be expanded to 
accept sets other than the unions and intersections which it 
currently accepts. For example rules with the following 
forms could be added to allow the system to calculate upper 
bounds on sets expressed as complements of simple sets: 

dosup (size, comp (Big set ,Set_List) , Answer) 

and 
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do sup (std_dev ,comp (Big set , Se t_List) , Attribute , Answer) 

Here the notation signifies that the desired query set is 
made of all tuples from the Bigset which are not also 
members of any of the sets in the Set_List. 

The important point of this section is not to offer 
specific suggestions as to how to extend the current system 
but to demonstrate how the flexibility of the current system 
lends itself to extension. A large part of this flexibility 
is made possible by the pattern matching ability of the 
PROLOG language. To construct such a prototype system with 
the same capability for expansion in a conventional impera- 
tive programming language would be practically impossible. 

B. IMPROVEMEHTS IH SYSTEM ACCOHACI 

It was discussed in chapter II that a significant 
improvement in system accuracy could be achieved if the 
rules for determining upper and lower bounds on the levels 
two through four information for set unions could be 
improved. Remember that whenever a set is processed by a 
*do' rule, the rule must make that set into a simple set 
with respect to the current level of knowledge by asserting 
into the internal database upper and lower bounds on the 
same level information which is stored about the actual 
partition sets. After a set union is processed by the level 
two rules, the upper bound on the mode frequency and number 
of distinct values which has been asserted into the PROLOG 
internal database is the sum of these statistics for each of 
the component sets in the union. This is a terrible bound 
on these statistics and when this union set is a component 
set in some higher level union or intersection, the inaccu- 
racy of these bounds is passed on to the calculation of the 
bounds on the size of the higher level set. Because of the 
method for converting input sets specified by a range of 
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values of an attribute into disjoint unio of partition 
sets of the attribute, a new tighter bound on the calcula- 
tion of the lower level statistics would significantly 
improve the accuracy of the bounds calculated for the size 
of the higher level set. 

Two general methods for improving the accuracy of the 
bounds on these level information statistics seem promising. 
The first is to test the lower level union to detect 
disjointness and to handle such disjoint sets differently 
than sets which are not disjoint. The problem with this 
approac-i is that it only shows an improvement for the parti- 
tioning attribute's calculations. For the other attributes, 
determining that the sets have no tuples in common does not 
guarantee that they do not have any values of the non- 
partitioning attribute in common. For example, in the 
sample medical database, if we were dealing with the two 
partition sets patno(l) and patno(2), we could state with 
confidence that the mode frequency patno of their union was 
the maximum of the mode frequencies for patno in the two 
component sets because these sets clearly are disjoint. 
However, consider the cholesterol attribute. The mode value 
for cholesterol in both of these sets is the same even 
though the sets are disjoint. Therefore, if we used the 
same rule for disjoint sets, the calculated upper bound on 
the mode frequency of the union would be much less than the 
actual mode frequency. 

The root of the problem in the above paragraph is the 
fact that for the non-partitioning attribute in the compo- 
nent sets of a union or intersection, we cannot determine 
whether the mode values of the sets are the same when we are 
only provided with the mode frequency. The second approach 
to im oving accuracy of these calculations is to augment 
the database a^^stract with the actual mode value in addition 
to the other statistics. This approach would increase the 
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size of the database abstract ty about one third but prob- 
ably would increase the information content of the database 
abstract by much more. 

C, IHPfiOVEMERTS IH SYSTEH EFFICIENCY 

As stated in chapter I, speed of execution of the system 
was not a primary ccncern during its construction. The 
purpose of a prototype system is not to produce the most 
efficient system but to prove the feasibility of the 
approach. Consequently, there is much room for improvement 
in the efficiency of this system. First of all, the exper- 
tise of the author in developing tight PROLOG code (if such 
a thing exists) , matured a great deal from the early to the 
late stages of the program. Many rules exist in the 
program, in which a cut predicate might greatly improve the 
speed of execution of the system. These cut predicates have 
been inserted in many rules but no determined effort has 
been made to optimize the code in this way. 

Another area which could increase the efficiency of the 
program is a shift in the overall structure of the highest 
level predicates. As stated many times previously, part of 
the purpose of every ’do’ rule is to produce the side effect 
of converting the input set into a simple set with respect 
to the current level of knowledge in case this set is a 
component set in a higher level union or intersection. This 
system inherently wastes a great deal of execution time 
calculating upper and lower bounds on levels two through 
five statistics for the highest level sets. 

One solution to this wastefulness would be to provide 
different sets of rules to deal with the highest level sets 
and the lower level sets. For example, the ’dosup21ow’ rule 
would calculate the upper bound on the input set as well as 
upper bounds on the mode frequency and number of distinct 
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values of the input set because it is certain that this set 
is part of a higher level union or intersection. However, 
the ’dosup2hi' rule would only calculate the upper bounds on 
the size of the input set because this rule is only called 
for the highest level sets. These rules could share many of 
the lower level rules which they call in order to minimize 
the increase in the size of the rule base for this revised 
system. 



SOflHABY 



The DAQUES system is designed primarily to demonstrate 
how the user of a database can be provided with a hierarchy 
of knowledge to produce increasingly more accurate bounds 
and estimations for his queries. A primary goal of the 
system is to provide the most rapid response to users’ 
queries possible, with the most accurate results possible 
when a user may not require exact answers to his queries or 
may not have access to the actual database containing that 
information. The improvements to the basic DAQUES system 
discussed in this chapter will help the system to better 
achieve these response time and accuracy goals. 
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HE ASSERT CLAUSES ASSERT EMPTY LISTS FOR EVERY PARTITION SET OF EVERY 
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